Topic Guide

  • Action verbs
  • Workflows
  • Strategies

R Data Workflow

Describe the daytime air temperatures at the Rice Rivers Center for February, 2014 by day of the week.

To do this, we have the following general workflow.

Combinations Yield Inferences

The manner in which we organize these action verbs yields an infinite number of combinations.

 

 

The Treachery of Images

The Pipe Operator

In R we use this grammar.

data %>% Y()

It is also possible to use this pipe character:

data |> Y()

To take the values in data and pass them as if you entered the data as the first argument to the function Y().

Chain-pipe

Pipes can be chained together into a single operation (you can mix and match pipe types but I find myself basically using the shorter one because I’m a lazy programmer).

data |>
  func1() |>
  func2() %>%
  func3() -> newData

The Data

The data we will be working with consist of data from the Rice Rivers Center.

The Raw Data

The raw data are published from a Google Spreadsheet.

The Data in R

So, let us load it in here.

url <- "https://docs.google.com/spreadsheets/d/1Mk1YGH9LqjF7drJE-td1G_JkdADOU0eMlrP01WFBT8s/pub?gid=0&single=true&output=csv"
rice <- read_csv( url )
names( rice )
 [1] "DateTime"                       "RecordID"                      
 [3] "PAR"                            "WindSpeed_mph"                 
 [5] "WindDir"                        "AirTempF"                      
 [7] "RelHumidity"                    "BP_HG"                         
 [9] "Rain_in"                        "H2O_TempC"                     
[11] "SpCond_mScm"                    "Salinity_ppt"                  
[13] "PH"                             "PH_mv"                         
[15] "Turbidity_ntu"                  "Chla_ugl"                      
[17] "BGAPC_CML"                      "BGAPC_rfu"                     
[19] "ODO_sat"                        "ODO_mgl"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

Column Selection

Using the column numbers instead of names.

df <- rice[ c(1,3,5,13)]
summary( df )
   DateTime              PAR              WindDir             PH      
 Length:8199        Min.   :   0.000   Min.   :  0.00   Min.   :6.43  
 Class :character   1st Qu.:   0.000   1st Qu.: 37.31   1st Qu.:7.50  
 Mode  :character   Median :   0.046   Median :137.30   Median :7.58  
                    Mean   : 241.984   Mean   :146.20   Mean   :7.60  
                    3rd Qu.: 337.900   3rd Qu.:249.95   3rd Qu.:7.69  
                    Max.   :1957.000   Max.   :360.00   Max.   :9.00  
                                                        NA's   :1     

Named Column Selection

  • Column names are probably better than column numbers

  • Additional assistance from RStudio via pop-ups.

  • Longer term readability (like next Tuesday & Beyond!)

Figure 1: Popup help for column names in RStudio for a data frame in memory

Data Pliars

The dplyr library defines a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges using select, filter, arrange, mutate, group_by, and summarise functionality.

It is part of tidyverse

Select Tidy

In tidyverse we can use the column names and do not need to be quoted.

rice |>
  select( DateTime ) -> df
head( df )
# A tibble: 6 × 1
  DateTime            
  <chr>               
1 1/1/2014 12:00:00 AM
2 1/1/2014 12:15:00 AM
3 1/1/2014 12:30:00 AM
4 1/1/2014 12:45:00 AM
5 1/1/2014 1:00:00 AM 
6 1/1/2014 1:15:00 AM 

Selecting Several

The select() function allows you to choose many columns of data to work with.

rice %>%
  select( DateTime, AirTempF ) -> df 
head(df)
# A tibble: 6 × 2
  DateTime             AirTempF
  <chr>                   <dbl>
1 1/1/2014 12:00:00 AM     31.0
2 1/1/2014 12:15:00 AM     30.7
3 1/1/2014 12:30:00 AM     31.2
4 1/1/2014 12:45:00 AM     30.5
5 1/1/2014 1:00:00 AM      30.9
6 1/1/2014 1:15:00 AM      30.6

Selecting to Rearrange

There are times when we want to

names( rice )
 [1] "DateTime"                       "RecordID"                      
 [3] "PAR"                            "WindSpeed_mph"                 
 [5] "WindDir"                        "AirTempF"                      
 [7] "RelHumidity"                    "BP_HG"                         
 [9] "Rain_in"                        "H2O_TempC"                     
[11] "SpCond_mScm"                    "Salinity_ppt"                  
[13] "PH"                             "PH_mv"                         
[15] "Turbidity_ntu"                  "Chla_ugl"                      
[17] "BGAPC_CML"                      "BGAPC_rfu"                     
[19] "ODO_sat"                        "ODO_mgl"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

Selecting to Rearrange

Love the everything() function.

rice |>
  select( RecordID, ODO_mgl, PH, everything() ) -> df
names( df )
 [1] "RecordID"                       "ODO_mgl"                       
 [3] "PH"                             "DateTime"                      
 [5] "PAR"                            "WindSpeed_mph"                 
 [7] "WindDir"                        "AirTempF"                      
 [9] "RelHumidity"                    "BP_HG"                         
[11] "Rain_in"                        "H2O_TempC"                     
[13] "SpCond_mScm"                    "Salinity_ppt"                  
[15] "PH_mv"                          "Turbidity_ntu"                 
[17] "Chla_ugl"                       "BGAPC_CML"                     
[19] "BGAPC_rfu"                      "ODO_sat"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

Filter

The function filter() works to select records (rows) based upon some criteria.

range( rice$AirTempF )
[1]  3.749 74.870

 

rice |>
  filter( AirTempF < 32 ) -> df
range( df$AirTempF )
[1]  3.749 31.990

Combinations of Filters

nrow( rice )
[1] 8199

 

rice |>
  filter( AirTempF > 32, 
          !is.na(PH), 
          Turbidity_ntu < 15) -> df
nrow(df)
[1] 1449

Arrange

We can sort entire data.frame objects based upon the values in one or more of the columns using the arrange() function.

rice$WindSpeed_mph[1]
[1] 3.87

 

rice |>
  arrange( WindSpeed_mph ) -> df 
df$WindSpeed_mph[1]
[1] 0

Reverse Sorting Order

To reverse the order, use the negative operator on the column name object in the function.

rice |>
  arrange( -WindSpeed_mph ) -> df 
df$WindSpeed_mph[1]
[1] 30.65

Cominations of Sorting

You can also sort using several criteria.

rice |>
  arrange( -WindSpeed_mph, WindDir ) -> df

Mutations

The mutate() function creates new columns of data.

class( rice$DateTime )
[1] "character"

Mutations

library( lubridate )
rice |>
  mutate( Date = mdy_hms(DateTime, tz = "EST") ) -> df
class( df$Date )
[1] "POSIXct" "POSIXt" 
summary( df$Date )
                 Min.               1st Qu.                Median 
"2014-01-01 00:00:00" "2014-01-22 08:22:30" "2014-02-12 16:45:00" 
                 Mean               3rd Qu.                  Max. 
"2014-02-12 16:45:00" "2014-03-06 01:07:30" "2014-03-27 09:30:00" 

Simultaneous Mutations

You can make several mutations in one call or you can pipe several mutation events at one time.

rice |>
  mutate( Date = mdy_hms(DateTime, tz = "EST"), 
          Month = month(Date, label = TRUE) ) -> df
summary( df$Month )
 Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec 
2976 2688 2535    0    0    0    0    0    0    0    0    0 

Summarize

The summarize() function derives inferences from the current data.frame and produces a new one.

rice |>
  summarize( `Air Temp` = mean( AirTempF), 
             `Water Temp` = mean(H2O_TempC, na.rm=TRUE))
# A tibble: 1 × 2
  `Air Temp` `Water Temp`
       <dbl>        <dbl>
1       38.8         5.53

Group & Summarize

The group_by() function allows you to arbitrarily pull together subset of data and prepare them to be worked on by something like summary().

rice |>
  mutate( Date = mdy_hms( DateTime, 
                          tz="EST"),
          Month = month( Date, 
                         abbr = FALSE, 
                         label=TRUE) ) |>
  group_by( Month ) |>
  summarize( `Air Temp` = mean( AirTempF), 
             `Water Temp` = mean( H2O_TempC, 
                                  na.rm=TRUE) ) 
# A tibble: 3 × 3
  Month    `Air Temp` `Water Temp`
  <ord>         <dbl>        <dbl>
1 January        34.7         3.68
2 February       39.7         5.29
3 March          42.6         7.96

Flows

Here are some strategies to consider.

  1. Do not think that you have to do the whole thing at once. I typically build up the workflow, one line at a time. Make sure the output from the previous line is what you think it should be then add the next one.
  2. Keep your code open and airy, it makes it easier to read and to catch any logical errors that may arise.
  3. You can pipe into a lot of different functions. In fact, any function that takes a data frame can be the recipient of a pipe. While developing a workflow, I will often pipe into things like head(), summary(), or View() to take a look at what is coming out of my workflow to make sure it resembles what I think it should look like.

Quoted Names

There are times when working with data that we may want to use reasonable names.

names( rice )
 [1] "DateTime"                       "RecordID"                      
 [3] "PAR"                            "WindSpeed_mph"                 
 [5] "WindDir"                        "AirTempF"                      
 [7] "RelHumidity"                    "BP_HG"                         
 [9] "Rain_in"                        "H2O_TempC"                     
[11] "SpCond_mScm"                    "Salinity_ppt"                  
[13] "PH"                             "PH_mv"                         
[15] "Turbidity_ntu"                  "Chla_ugl"                      
[17] "BGAPC_CML"                      "BGAPC_rfu"                     
[19] "ODO_sat"                        "ODO_mgl"                       
[21] "Depth_ft"                       "Depth_m"                       
[23] "SurfaceWaterElev_m_levelNad83m"

Quoted Names

If the names are set reasonably in the workflow, then they will be piped directly into tables and figures correctly. Here is a fully formed output from R.

rice |>
  mutate( Date = mdy_hms( DateTime, 
                          tz="EST"),
          Month = month( Date, 
                         abbr = FALSE, 
                         label=TRUE),
          AirTemp = (AirTempF-32) * 5/9 ) |>
  group_by( Month ) |>
  summarize( `Ambient Air` = mean( AirTemp), 
             `Water Surface` = mean( H2O_TempC, 
                                  na.rm=TRUE) ) -> df 

Tabular Output

So far, we’ve been showing the output as raw R output in the console or in Markdown. However, let’s take it up a notch and get a bit more formal and use a package that will make professional looking tables.

Here is the official GitHub repository for the project.

Using kable

Simply pipe the data into the kable() function.

library( knitr )

df |>
  kable()
Month Ambient Air Water Surface
January 1.498863 3.676346
February 4.302240 5.289107
March 5.888374 7.959093

Options to kable

df |>
  kable( digits = 2, 
         caption = "Air and water temperatures measured at the Rice Rivers Center.") -> kbl

kbl
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

For more options see ?kable

Making Fancier Tables

There is a second package, kableExtra, that adds a lot more functionality to the display of tabular output.

Here is a great overview of the various customization that is available.

Basic Style

library( kableExtra )

kbl |> 
  kable_styling()
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Paper Style

kbl |> 
  kable_paper()
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Classic Style

kbl |> 
  kable_classic( full_width = FALSE )
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Fancy Material Styles

kbl |> 
  kable_material( c("striped", "hover"), 
                  full_width = FALSE )
Air and water temperatures measured at the Rice Rivers Center.
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Quoted Names & More Fancy Headings

kbl |> 
  kable_paper( full_width=FALSE) |>
  add_header_above( c(" "=1, "Temperature (°C)" = 2))
Air and water temperatures measured at the Rice Rivers Center.
Temperature (°C)
Month Ambient Air Water Surface
January 1.50 3.68
February 4.30 5.29
March 5.89 7.96

Questions

If you have any questions, please feel free to post to the Canvas discussion board for the class, drop me an email, or show up for Office Hours.

Peter Sellers looking bored